Introduction

Prosper Marketplace is America’s first peer-to-peer lending marketplace, with over $7 billion in funded loans. Borrowers request personal loans on Prosper and investors (individual or institutional) can fund anywhere from $2,000 to $35,000 per loan request. Investors can consider borrowers’ credit scores, ratings, and histories and the category of the loan. Prosper handles the servicing of the loan and collects and distributes borrower payments and interest back to the loan investors.

Prosper verifies borrowers’ identities and select personal data before funding loans and manages all stages of loan servicing. Prosper’s unsecured personal loans are fully amortized over a period of three or five years, with no pre-payment penalties. Prosper generates revenue by collecting a one-time fee on funded loans from borrowers and assessing an annual loan servicing fee to investors.

The dataset is provided by Udacity as part of Nanodegree program - project work. We will be exploring the dataset using R language. Lets explore the structure of the dataset first.

## Dimensions : 
##  Rows: 113937     Columns:  81

The dataset has 113937 records and 81 columns. Exploring 81 columns is a tremendous task. We will limit ourselves to 15 or 20 columns as part of our analysis. A little bit information about variables we will be using in our analysis.

  • Term: The length of the loan expressed in months.
  • LoanStatus: The current status of the loan.
    • Cancelled
    • Chargedoff
    • Completed
    • Current
    • Defaulted
    • FinalPaymentInProgress
    • PastDue: The PastDue status will be accompanied by a delinquency bucket.
  • ProsperRating (Alpha): The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
  • ProsperRating (numeric): The Prosper Rating assigned at the time the listing was created.Applicable for loans originated after July 2009.
    • 0 - N/A
    • 1 - HR
    • 2 - E
    • 3 - D
    • 4 - C
    • 5 - B
    • 6 - A
    • 7 - AA.
  • ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-11, with 11 being the best, or lowest risk score. Applicable for loans originated after July 2009.
  • Occupation: The Occupation selected by the Borrower at the time they created the listing.
  • CreditScoreRangeLower: The lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
  • CreditScoreRangeUpper: The upper value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
  • BankcardUtilization: The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
  • AvailableBankcardCredit: The total available credit via bank card at the time the credit profile was pulled.
  • DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
  • IncomeRange: The income range of the borrower at the time the listing was created.
  • StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created.
  • ListingCategory: The category of the listing that the borrower selected when posting their listing.
    • 0 - Not Available
    • 1 - Debt Consolidation
    • 2 - Home Improvement
    • 3 - Business
    • 4 - Personal Loan
    • 5 - Student Use
    • 6 - Auto
    • 7 - Other
    • 8 - Baby&Adoption
    • 9 - Boat
    • 10 - Cosmetic Procedure
    • 11 - Engagement Ring
    • 12 - Green Loans
    • 13 - Household Expenses
    • 14 - Large Purchases
    • 15 - Medical/Dental
    • 16 - Motorcycle
    • 17 - RV
    • 18 - Taxes
    • 19 - Vacation
    • 20 - Wedding Loans
  • EmploymentStatus: The employment status of the borrower at the time they posted the listing.
  • BorrowerAPR: The Borrower’s Annual Percentage Rate (APR) for the loan.
  • BorrowerRate: The Borrower’s interest rate for this loan.
  • EstimatedReturn: The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.
  • Investors: The number of investors that funded the loan.
  • LoanOriginalAmount: The origination amount of the loan.

Data Wrangling

Before we begin our analysis we need to make some changes to our dataset. We will first filter our dataset, so that it will have just the columns we need.

## 'data.frame':    113937 obs. of  20 variables:
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ EstimatedReturn          : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.  : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.    : Factor w/ 7 levels "A","AA","B","C",..: NA 1 NA 1 5 3 6 4 2 2 ...
##  $ ProsperScore             : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ Occupation               : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
##  $ EmploymentStatus         : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ BankcardUtilization      : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit  : num  1500 10266 NA 30754 695 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ Investors                : int  258 1 41 158 20 1 1 1 1 1 ...

From the structure of the dataset, we can observe

  1. Term is in months. Term in years will benefit us.
  2. Listing Category is in numeric. Listing Category description will convey more information than numeric
  3. Employment Status & Income Range are factors with 8 levels
  4. Loan Status is a factor with 12 levels
  5. Occupation is a factor with 67 levels

We will convert Term from months to years and store it under new column Term_Yrs and also the Listing Category description will be stored in ListingCategory..alpha. column.

## Term before change:
##   36 36 36 36 36 60
## Term after change:
##   2 2 2 2 2 3
## Unique values in Listing Category:
##  0 2 16 1 7 13 6 15 20 19 3 18 8 4 11 14 5 9 17 10 12
## Variables in data frame:
## 
##  [1] "Term"                      "LoanStatus"               
##  [3] "BorrowerAPR"               "BorrowerRate"             
##  [5] "EstimatedReturn"           "ProsperRating..numeric."  
##  [7] "ProsperRating..Alpha."     "ProsperScore"             
##  [9] "ListingCategory..numeric." "Occupation"               
## [11] "EmploymentStatus"          "CreditScoreRangeLower"    
## [13] "CreditScoreRangeUpper"     "BankcardUtilization"      
## [15] "AvailableBankcardCredit"   "DebtToIncomeRatio"        
## [17] "IncomeRange"               "StatedMonthlyIncome"      
## [19] "LoanOriginalAmount"        "Investors"                
## [21] "Term_Yrs"                  "ListingCategory..alpha."

We can see that the new columns are now present in our dataset. Let’s explore.

Exploratory Data Analysis

Term:

## Summary by Term:
##     1     3     5 
##  1614 87778 24545

This suggests that more people get a loan with a term of 3 years followed by 5 year loans.

Loan Status:

The plot is self explanatory. There are more loans that are current/live/running than other loans. About 38000 loans have been completed/closed which presents a good picture. The loans have been payed back successfully without defaulting. The number of loans in Charge-off category is huge. What is Charge-off?

Charged-Off: If your loan becomes more than 120 days past due (you missed the last 5 monthly payments), your loan will be considered “charged-off.” This does not mean your loan has been excused or forgiven. You are still obligated to make payments. When a loan is charged off, the entire balance is accelerated, meaning it is collectible in full as of the charge-off date.
-Prosper

The high number of Charge-Off loans though paint a not-so-good picture. Let us check how Loan Status is distributed with respect to Term.

Loan Status By Term:

Clearly the short term loan(1 year) has a bit better performance, since most of the 1 year loans are completed or in final payment stage. The number of loans that are in delinquency is very less when compared to other term loans.

Prosper Rating:

What is Prosper Rating?

Prosper Rating: It is a proprietary rating figure provided to prospective borrowers based on the company’s estimation of that borrower’s “estimated loss rate”. According to the company, that figure is “determined by two scores: (1) the credit score, obtained from an official credit reporting agency, and (2) the Prosper Score, figured in-house based on the Prosper population.” Prosper Ratings, from lowest-risk to highest-risk, are labeled AA, A, B, C, D, E, and HR (“High Risk”)
-Wiki

Prosper Ratings allow potential investors to easily consider a loan application’s level of risk because the rating represents an estimated average annualized loss rate range to the investor.

There are many loans in the rating category ‘C’. But surprisingly, the number of loans under “High Risk” is more than the number of loans under “AA” which is the top category.

Prosper Score:

Prosper Score:A custom risk score was built using historical Prosper data to assess the risk of Prosper borrower listings. The output to Prosper users is a Prosper score which ranges from 1 to 11, with 11 being the best, or lowest risk, score. The worst, or highest risk, score, is a 1. Prosper uses both the custom score and the credit reporting agency score together to assess the borrower’s level of risk and determine estimated loss rates. The loss estimates are based on the historical performance of Prosper loans to borrowers with similar characteristics.
-Prosper

The Prosper score estimates the probability of a loan going “bad,” where “bad” is the probability of going 60+ days past due within the first twelve months from the date of loan origination.

The number of loans with Score of 4,6 & 8 seem to be higher than the number of loans in other scores. In general, the number of loans with scores 4-8 are more than the loans in other score categories.

Prosper Rating by Prosper Score

There is a clear pattern that we can see in the above plot. As Prosper Score increases from 1 to 11, the number of higher rated loans increases. There seems to be a positive relationship between Prosper Score and Prosper Rating. Let’s check this out using a boxplot.

The evidence is very clear. There is correlation between Prosper Rating & Prosper Score. The relationship is positive. The correlation co-efficient is 0.70 for Pearson’s Correlation. The details are below.

## 
##  Pearson's product-moment correlation
## 
## data:  df$ProsperScore and df$ProsperRating..numeric.
## t = 289.74, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7018231 0.7085876
## sample estimates:
##       cor 
## 0.7052214

Occupation

It is the occupation stated by the borrower at the time listing was created.

As we saw earlier, there are 67 types of Occupations found in our dataset. The most common Occupation specified by borrowers is ‘Others’. Lets checkout the number of loans for the most common and the least common occupations. To do this, we will calculate the mean, median scores & count and store in a separate dataframe. This will be used for our analysis.

## Structure of new dataframe:
## # A tibble: 6 x 4
##   Occupation          mean_score median_score     n
##   <fct>                    <dbl>        <dbl> <int>
## 1 Other                     5.76            6 21317
## 2 Professional              6.21            6 10542
## 3 Executive                 6.62            7  3468
## 4 Computer Programmer       6.88            7  3236
## 5 Teacher                   5.86            6  2888
## 6 Analyst                   6.51            7  2735

Prosper Score by Occupation

Will Prosper Score vary by occupation. How will the top 10 and bottom 10 Occupations perform in Prosper Score?

The green bar shows the median prosper score while the black dot represents the mean prosper score for the same Occupation. Surprisingly, Students from Technical School have the highest Prosper Score. However the prosper score of Students from College have one of the least scores.

Credit Score Range

Credit Score Range is provided as two columns in the dataset.

  • Credit Score Range Lower: The lower limit of the Borrower’s Credit Score provided by Consumer Credit Rating Agency.
  • Credit Score Range Upper: The lower limit of the Borrower’s Credit Score provided by Consumer Credit Rating Agency.

If we check the variable for unique values, we get this

## Unique values in Lower Range:
##  [1] 640 680 480 800 740 700 820 760 660 620 720 520 780 600 580 540 560
## [18] 500 840 860  NA 460   0 880 440 420 360
## Unique values in Upper Range:
##  [1] 659 699 499 819 759 719 839 779 679 639 739 539 799 619 599 559 579
## [18] 519 859 879  NA 479  19 899 459 439 379

So we will factor it, and plot the range.

From the plots, we can see a similarity between the lower and upper scores. The difference between the upper and lower range values seem to be around 20 points. This is an expected outcome. If the lower range value of Credit Score is high, it means that the higher range value of Credit Score will also be high. This is clearly evident from the below plot.

Prosper Rating by Credit Score Range

Clearly, with increase in Prosper Rating, there is increase in Credit Score.

Prosper Score by Credit Score Range

Since we know that Prosper Rating and Prosper Score are positively correlated, we can expect a similar output as to the previous one. Lets see.

Yes. With increase in Prosper Score, the Credit Score also increases.

Bank Card Utilization

Bank Card Utilization: Credit card utilization - or just credit utilization, for short - refers to how much of your available credit you use at any given time. The credit utilization ratio is a component used by credit reporting agencies in calculating a borrower’s credit score. Lowering the credit utilization ratio can help a borrower to improve their credit score.
Investopedia

There are more number of borrowers who have higher Card Utilization rate, meaning more borrowers are using their available credit to its limit.

Card Utilization by Prosper Score

From its definition, Card Utilization has a negative influence on borrower’s credit score and rating. We will plot the mean and median Card Utilization values for each Prosper Score. I would expect a negative relationship.

Surely, the relationship is negative. As Prosper Score increases, the Card Utilization rate is decreasing. But we should also consider other hidden factors. The Total Available Credit which forms the base for the Utilization Ratio may be less or more. This could influence the graph.

Available Credit

Available Credit: Your available credit is the amount you can use for purchases at any given time. It is the difference between your total credit limit and your account balance.
Lendedu

## Summary on Available Credit:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     880    4100   11210   13180  646285    7544
## Available Credit at 99% percentile:
##      99% 
## 89005.32

We can see that there is a huge gap between the maximum credit and third quartile(75%). We can see that 99% of values lie below 89005. We will use this information to filter our data to plot the 99% of Available Credit Values.

There are more number of borrowers having lesser Available Credit. From summary we know that the median value is 4100.

Available Credit by Prosper Score

What is the relationship between Available Credit and Prosper Score?

With increase in Prosper Score, the Available Credit also increases, which means the Utilization Ratio decreases or borrowers with a better Prosper Score tend to Utilize less amount from their Total Credit Limit and hence, their Available Credit Amount will be high.

Available Credit & Card Utilization by Prosper Score

We will try plotting the relationship between Available Credit and Card Utilization Ratio along with Prosper Score. Let us see if we can find any relationship.

From the plot, we can see that the values are dispersed. We cannot find a clear relationship pattern between them. However, we can see that higher prosper score borrowers have a higher Available Credit in comparison to lower score borrowers. Also lower score borrowers in general have a lower Availale Credit balance. The line shows the median Available Credit for Card Utilization values.

Debt-To-Income Ratio

Debt-To-Income: The debt-to-income (DTI) ratio is a personal finance measure that compares an individual’s debt payment to his or her overall income. The debt-to-income ratio is one way lenders, including mortgage lenders, measure an individual’s ability to manage monthly payment and repay debts. DTI is calculated by dividing total recurring monthly debt by gross monthly income, and it is expressed as a percentage. A low debt-to-income (DTI) ratio demonstrates a good balance between debt and income. Conversely, a high DTI can signal that an individual has too much debt for the amount of income he or she has.
-Investopedia

Summary of DTI in our dataset

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1400  0.2200  0.2759  0.3200 10.0100

We will plot the variable.

There are less number of borrowers whose DTI is greater than 1. Most of the borrowers have a DTI value below 1 and between 0.14 & 0.32. For our analysis we will concentrate on DTI scores below 1.

Debt-To-Income by Prosper Score

We will study the relationship between DTI and Prosper Score. We will calculate the median DTI value for each score and plot them.

## Median DTI values for Prosper Scores
## # A tibble: 11 x 2
##    ProsperScore median_dir
##           <dbl>      <dbl>
##  1            1       0.33
##  2            2       0.27
##  3            3       0.28
##  4            4       0.27
##  5            5       0.25
##  6            6       0.24
##  7            7       0.21
##  8            8       0.19
##  9            9       0.17
## 10           10       0.16
## 11           11       0.19

The median DTI value keeps decreasing as Prosper Score increases. The interesting thing is for the highest score-11, the median DTI is greater than the median DTI of scores 9 & 10

The spread of DTI seems to be inversely proprotional to Prosper Score. With lower scores, the DTI values are more spread in the graph. As the score increases, the graph becomes more concentrated with less spead.

Income Range

It is the income range of the borrower. The available income ranges are

## Income Ranges in dataset:
##   $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999 $75,000-99,999 Not displayed Not employed

Lets plot it out!

Most borrowers are in the income range of $25,000 to $75,000. Also the number of borrowers from $100,000+ category seems to be slightly higher than $75,000 to $99,999 category.

Income Range by Term

We have already seen that more number of loans are with 3 year term. 5 year term loans have the second place. Here, we see the same reflected. There’s more 3 year term loans. Looking at how they are distributed, category $1-$24,999 has very few 1 year term loan in comparison to other income categories. Also category $50,000-$74,999 has more number of 5 year loans in comparison.

Income Range by Prosper Score

How is the income range distributed across Prosper Score. Does higher Prosper Score have more people from particular Income group ? Let’s find out.

Yes. We can see that for lower scores, the lower income ranges lead the higher income range, meaning the number of loans with lower prosper score is more for lower income range borrowers. But as score increases, slowly the higher income group takes the lead.

Debt-To-Income by Income Range

How about Debt-To-Income ratio and Income Range ? Do lower income range borrowers have higher DTI value ?

The income range of $1 to $24,999 seems to have more number of borrowers with DTI value greater than 1. There are more borrowers who have not displayed their income range but have higher DTI values. For higher income ranges, the DTI values are concentrated below the value of 1.

Stated Monthly Income

The monthly income stated by the borrower. Nearly 75% of borrowers have a monthly income leass than $7000. We can also see that the maximum value is around 1 million which seems hard to believe. It could be an outlier or may be an error. We will not go into it now.

## Summary of Stated Monthly Income
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

We will plot the variable.

The top 99% of borrowers are below the monthly income of $20530. Most borrowers seem to have an income range of $3000 to $5000.

Stated Monthly Income by Income Range

In order to find the relationship, we will create a new dataframe with mean, median and count for Stated Monthly Income grouped over Income Range.

## Structure of new dataframe:
## # A tibble: 6 x 4
##   IncomeRange     mean_income median_income     n
##   <ord>                 <dbl>         <dbl> <int>
## 1 $0                 0.000134            0    621
## 2 $1-24,999       1428.               1583.  7274
## 3 $25,000-49,999  3131.               3167. 32192
## 4 $50,000-74,999  5025.               5000  31050
## 5 $75,000-99,999  7057.               7000  16916
## 6 $100,000+      12446.              10417. 17337

We will explore the relationship between income group and stated monthly income. There should high correlation between the variables. The above chart shows that. The red line shows Median Stated Income and the black line show the Mean Stated Income. The Stated Income value increases with Increase in Income Range.

Listing Category

The Listing selected by borrower for loan. Specifies the reason for getting loan. There are around 20 categories in listing.

The most popular category is Debt Consolidation

Listing Category by Income Range

We have plotted the distibution of Listing Category by Income Range. There is a similarity in Listing Category for all incomes except for $0. In $0 income group, the most popular vategory is Business followed by Personal Loan

Employment Status

Most of the Borrowers are Employed(working multiple jobs) or they are full-time employees.

Employment Status by Term

Do people of certain employment prefer a certain term loan?

Borrowers with Full-Time employment mostly prefer a loan with 3 year term. 5 year term loans are preferred by borrowers in Employed category.

Employment Status by Listing Category

A surprise here is that, borrowers of employment status-employed have taken loans in many categories except Personal loan and Student use.

Borrower APR

Borrower APR: Annual Percentage Rate (APR) is the cost of credit as a yearly rate.The APR figures in not just your interest rate, but also some fees associated with your loan over its lifetime. At Prosper, this means the closing fee charged when you first borrow the money. This closing fee is paid out of the loan proceeds when the loan originates.
-Prosper

The APR value of 0.36 seems to have the highest count, which is surprising because we see a spiking in dataset.

Borrower APR by Loan Status

We can see that the number of Charged-Off loans are high in the Borrower APR value of 0.36. In fact it is higher than all the other APR values. The number of current loans are high for APR values 0.18 & 0.20.

Borrower APR by Income Range

What is the average APR for each income group ?

The APR value seems to be decreasing with the increasing in income range. But we notice that for “Not Employed” category the APR is high, even higher than other groupings. The income range of $0 however has a lower APR score. We have previously seen, that for the Income Range of $0, the popular Listing Category was Business followed by Personal Loan, which was unlike other Categories. This suggests that most borrowers in this Income Range are business men/women . This can also explain the lower interest rate charged for loans.

Borrower APR by Prosper Score

How is the Borrower APR rate distributed for Prosper Scores.

From the plot, there is a clear pattern. With increase in Prosper Score, the Borrower APR decreases. We will also check this out using boxplot.

Yes. The relationship clear. There is High Negative Correlation between Prosper Score & Borrower APR. The correlation details are below.

## 
##  Pearson's product-moment correlation
## 
## data:  df$ProsperScore and df$BorrowerAPR
## t = -261.68, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6719940 -0.6645469
## sample estimates:
##        cor 
## -0.6682872

Borrower APR by Prosper Score & Term

Now that we see the relationship, we will also add in the variable Term (in years)

There are a few things we can notice in this plot. The rate of change or slope of the median Borrower APR values in each Term is different. The slope decreases with increase in Term, meaning the median APR value changes more in 1 year loans than in 5 year loans. Also we see that there are no 1 year loans with Prosper Score 11.

Borrower Rate

Borrower Rate: Borrower/Interest rate is the amount charged, expressed as a percentage of principal, by a lender to a borrower for the use of assets. The assets borrowed could include cash, consumer goods, and large assets such as a vehicle or building.
-Investopedia

Borrower Rate vs. Borrower APR:

The Interest rate refers to the annual cost of a loan to a borrower and is expressed as a percentage. The interest rate does not include fees charged for the loan. The APR (Annual Percentage Rate) reflects the annual cost of a loan to a borrower including any fees charged to originate the loan.

We can see a similarity in distribution between Borrower APR and Borrower Rate. There is a spike at 0.32 similar to 0.36 in APR.

Borrower Rate by Term

There is increase in Borrower rate with increase in Term. This is surprising! Maybe there are other factors that influence this change, the loan amount, Prosper Score or Prosper Rating, Loan Category..etc

Borrower Rate by Listing Category

There is variation in Interest Rate with change in category. Loans for Cosmetic Procedures have the highest Borrower Rates. Then comes Household Expenses loans.

Borrower Rate vs Borrower APR

There is high correlation between Borrower Rate and Borrower APR. This is practical since, Borrower APR is Borrower Rate + other charges.

Borrower Rate vs Borrower APR by Term

We have plotted the relationship between Borrower Rate & Borrower APR for each Loan Term. The black lines are the best fit regression lines. The correlation results are below.

## 
##  Pearson's product-moment correlation
## 
## data:  df$BorrowerAPR and df$BorrowerRate
## t = 2347.7, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9897057 0.9899409
## sample estimates:
##      cor 
## 0.989824

Estimated Return

Estimated Return: Estimated Return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Estimated Effective Yield is equal to the borrower interest rate: (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. The Estimated Loss Rate is the estimated principal loss on charge-offs, and is based on the historical performance of Prosper loans. All estimates are based on the historical performance of Prosper loans for borrowers with similar characteristics. The calculations of Estimated Return, Estimated Effective Yield, and Estimated Loss Rate require significant assumptions about the repayment of loans, and investors should make their own judgments with respect to the accuracy of these assumptions. Actual performance may differ from estimated performance.
Prosper

The estimated returns are negative in some cases. The nominal return is estimated to be around 6-12%.

Estimated Return by Term

We have significant outliers for 3 year loans. They extend in the negative too. At first sight, we notice that the estimated return is less for 1 year term loans. With increase in Term, there is increase in return.

Estimated Return vs Borrower Rate

This is an important comparison because, Borrow Rate is the interest charged for the loan which is ultimately paid by the borrower and Estimated Return is the return that the investor can expect from the loan. The estimated return is the return on investment that the investor can expect. If there is a relationship then this could guide Investor decisions on his/her investments. Let’s check it out.

On seeing the graph, we can notice clear patterns hidden. Straight lines with different slopes on top of one another. What could be the factor that will give us clarity? Lets begin with Loan duration or Term.

Estimated Return vs Borrower Rate by Term

Aahaa! It’s clear now that the duration of loan plays a major role in this comparison. The rate of change or slope is different in each case. With increase in Borrow Rate, the estimated return is more in the case of 3 year loans. Explains why there are more number of loans under 3 year loan category. There could be other reasons for sure but this is one of the influencers.

Hmm..even now, we can see that in all 3 Loan Terms, there are straight lines of varying slope layered upon one another. We need to dig deeper! We will compute the relationship by further splitting it using Prosper Rating.

Estimated Return vs Borrower Rate by Prosper Rating & Term

Whew! The plot has a lot of information to process.

  • It is very clear that the Estimated Return under “High Risk” score for 3 year loans has multiple return rates. There should be some underlying factor for this clear difference.
  • 3 year term - loans in general have a higher range .i.e. the difference between minimum and maximum values in each case, will be higher than other Term loans.
  • We can see that with increase in Rating, the Borrow Rate is less and so is the Return Rate.
  • We did not find any 1 year or 5 year loans with “High Risk” Prosper Rating

Investors

After looking at Interest Rate and Estimated Return, we should look into Investors.

The graph is skewed with most of the values to the left, meaning the number of investors for each loan is less than 300 for most cases. We’ll look at the summary for this variable.

## Summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

One can see that 75% of loans have number of investors less than 116. Also most number of loans have only 1 investor during the time of listing. We will filter and analyze loans with more than 1 investors, in order to clearly identify the pattern of data distribution.

Investors by Prosper Score

The number of investors are more for borrowers with a high Prosper Score. But for the score of 11, this pattern seems to be broken. This could well be because of less number of loans with a Prosper Score of 11, maybe because the loan amount was less and hence less number of investor will fulfill the requirement or maybe one of various other reasons.

Investors by Estimated Return

Most of the investors have invested on loans with an estimated return greater than 0. However we can find investments made for loans with negative returns, this could lead to a loss of their investment. The number of investors maxes out in the range of 0.05-0.11 which shows that most of the investors are interested in investing where the returns are 6-12%. With increase in returns, there is a factor of risk involved and hence the number of investors reduce.

Investors by Prosper Rating and Term

A pattern that is observed here is that, the number of investors increase from <200(rating-HR) to >1000(rating-AA) as Prosper Rating increases.

Loan Amount

It is the loan amount loaned to the borrower.

We can see the irregular spikes in the plot. That’s because this is Loan amount and generally people specify amounts like 10,000 & 15,000 when they apply for loan. That is what is seen here. We have spikes at 4000,10000,15000,20000 & 25000.

Loan Amount vs Investors

We will expect the number of investors to increase as Loan Amount increases. That is what we see in the plot. But for Loan Amount less than 10,000 the distribution is not clear. This could be because of the varying amounts requested for loan and also more investors investing in those loans. This can be seen in the second plot, where the plot is zoomed-in for amount < 10000.

Loan Amount vs Investors by Term

From the plot, we can see that below 10000, there are more loans with a Term of 3 years. It also has more number of investors.

Loan Amount vs Investors by Term and Prosper Rating

For “High Risk” loans, both the number of Investors and the loan amount is less. With increasing Prosper Rating, both Loan Amount & the number of Investors increase.

Final Plots and Summary


Plot One

In this plot, we are examining the relationship between Prosper Score and Prosper Rating. Prosper score estimates the probability of a loan going “bad,” where “bad” is the probability of going 60+ days past due within the first twelve months from the date of loan origination. Prosper Score is scored 0(worst) to 11(Best).

Prosper Rating is internal rating provided by Prosper which represents an estimated average annualized loss rate range. Rating is categorized from AA(Best) to HR(Worst).

From the plot, we can see that with increase in Prosper Score, there is increase in the number of higher rated loans. For score of 1 & 2, most loans are in HR and E categories. For score 3, most loans are in D category. For scores 4 & 5, ‘C’ category loans are high. From score 6 on, we can see higher rated loans(A,AA) steadily grow and dominate the graph. Although higher rating loans are present in lower prosper scores like 6,7,8 the number of such loans is less in comparison. The Pearsons correlation r=0.705 for Prosper Score and Prosper Rating.

Plot Two

Available Credit is the amount that the borrower can use on credit. It is the balance amount in card that the borrower has after he has made some purchases. It is the difference between Total Credit and Credit Balance. Bank Card Utilization is the percent of Total Credit the borrower has utilized.

Now, the dataset has details about borrowers at a specific time frame. If a listing is published today, the results may be or may not be the same as the results we have from our dataset. However, in our dataset with more than 113K records, we find this relationship that with increase in Prosper Score, the Bank Card Utilization goes down. The Utilization rate for Score 11 seems to break the trend but still the margin is not huge. At the same time we see that the Available Bank Credit increases with increase in Score. This increase Bank Credit may be due to less Utilization but I’m sure that it is not the only factor. We saw in our analysis, that with increase in Prosper Score there is increase in borrowers from higher income range. There is a possibility that the borrowers increased their Credit Limit which in-turn resulted in reduced Utilization and increased Available Credit. This is just an assumption made to identify the underlying factors of influence

Plot Three

I could not end this project without this correlation matrix. This matrix is a storehouse of information and yet all displayed in a single plot. The highest value of correlation is 1 & -1 both found in the matrix.

  • Borrower Rate(Interest Rate charged for loan) and Borrower APR(Interest Rate + Applicable Charges) have a correlation score of 1
  • Credit Score Range - Upper and Lower values have a correlation score of 1.

Although the practical significance of these correlation scores are not much once we know the meaning of the variables, statistically proving their relationship is significant. Other than the above relationships, there are a few more that I would like to explore.

  • Prosper Rating and Estimated Return has a correlation of -0.7. This is significant because, this says that with increase in Rating of the loan, the expected return for investor is less. I think we should also consider one more relationship in order to fully grasp this finding. Prosper Rating and Borrower Rate has a correlation of -1. Even Prosper Rating and Borrower APR has correlation score of -1. What does this mean?

    This means that loans with good rating are charged for less interest from Borrowers and ultimately, when less interest is charged, the return on invesment will also be less. Furthermore, we can see that Estimated Return and Borrower Rate has a correlation of 0.8, meaning increase in Return is directly proportional to interest rate charged. This just supports the findings we just made before.

Reflection


The prospser dataset contains over 113k observations with 81 variables. The first Obstacle faced in analyzing the dataset are - Understanding the variables, terminology and general domain knowledge of financial peer-to-peer lending. Even after that, dertermining which variables to analyze, not drifting too far off any one path of investigation and not pulling in new variables throughout the process was tough.

However, success was found in many areas. The general analysis revealed areas of interests such as positive correlation between borrowing rate and return rate which brough up many questions concerning investment and loan rating and the perplexity of cause and effect. Also, trends were confirmed and unexpected, unknown relationships were revealed.

Addition to this dataset would enhance the validity of this analysis. A time series data for a particular borrower would be more useful in indentifying patterns on usage, borrowing, repayment, defaulting…etc. I was able to find many ‘NA’ values in the current dataset. If the number of ‘Na’ were less, this analysis would reveal even more about the dataset.